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Abstract 

In this paper, we consider unregularized online learning algorithms in a Reproduc¬ 
ing Kernel Hilbert Spaces (RKHS). Firstly, we derive explicit convergence rates of the 
unregularized online learning algorithms for classification associated with a general a- 
activating loss (see Definition 1 below). Our results extend and refine the results in |30j 
for the least-square loss and the recent result [3] for the loss function with a Lipschitz- 
continuous gradient. Moreover, we establish a very general condition on the step sizes 
which guarantees the convergence of the last iterate of such algorithms. Secondly, we 
establish, for the first time, the convergence of the unregularized pairwise learning al¬ 
gorithm with a general loss function and derive explicit rates under the assumption of 
polynomially decaying step sizes. Concrete examples are used to illustrate our main 
results. The main techniques are tools from convex analysis, refined inequalities of 
Gaussian averages 0, and an induction approach. 

Keywords: Learning theory, Online learning, Reproducing kernel Hilbert space, Pair¬ 
wise learning, Bipartite ranking 


1 Introduction 

Let the input space A be a complete metric space and the output space y = {±1}- In the 
standard framework of learning theory mm, one considers the problem of learning from 
a set of examples z = {zj = (ay, ) G A x y : i = 1, 2,... , T} which are independently and 
identically distributed (i.i.d.) according to an unknown distribution p on Z = X x y. 

In the task of classification, a univariate loss function <p(yf(x)) measures the error when f{x) 
is used to predict the true label y. In this case, one aims to find a predictor in a hypothesis 
space to minimize the following true (generalization) error which is defined, for a function 
g : A R, by 

£(g) = ffj(y 9 ( X ))d P (x,y). 
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In contrast to the task of classification, pairwise learning problems involve a pairwise loss 
function 4>((y — y')f(x,x')) for a hypothesis function / : X x X —x R. Notable examples of 
pairwise learning tasks include bipartite ranking Dannasi, similarity and metric learning 
El [26], AUC maximization [35] and gradient learning na na ED- The aim of pairwise 
learning is to minimize the true error which is defined, for a pairwise function / 
by 

£(/) =l[ <K(y - y')f{x, x'))dp{x, y)dp(x',y'). 

J JZxZ 

In this paper, we consider online learning algorithms for both classification and pairwise 
learning tasks in a Reproducing Kernel Hilbert Space (RKHS). Specifically, let G : X x X —x 
R be a Mercer kernel, i.e. a continuous, symmetric and positive semi-definite kernel, see 
e.g. H2J [23]. According to [2], the RKHS Tic associated with kernel G is defined to be the 
completion of the linear span of the set of functions {G x (-) '■= G(x,-) : x € A} with an inner 
product satisfying the reproducing property, i.e., for any x',x € X, (G x ,G x i)g = G(x,x'). 
Similarly, for pairwise learning, we assume that the pairwise function / : X x X —X R is from 
an RKHS defined on the domain X 2 := X x X with a (pairwise) kernel K : X 2 x X 2 —> R. 
Throughout this paper, we consider a specific family of loss functions called a-activating loss 
defined as follows. 

Definition 1. A function f : R —X R + is called an a-activating loss with some a € (0,1] if 
it is convex and differentiable, 4>'(0) < 0, and L := sup^-,, eR l<^(s) - 0'(-s)l/|s - «r < °°- 

Our definition of a-activating loss follows [28] where the concept of the activating loss was 
first introduced. One can find in-depth discussions in HIM] on loss functions for classifica¬ 
tion. Typical examples of a-activating losses includes (/-norm loss |T 0 , ;33] 4>{s) = (1 — s)+ = 
max{l — s , 0 } 9 for the support vector machine (SVM) classification with 1 < q < 2, the least 
square loss 4>(s) = (1 — s ) 2 and the logistic regression loss 4>(s) = log(l + e~ s ). 

The first purpose of this paper is to study the unregularized online learning algorithm for 
classification associated with a general a-activating loss defined as follows. 

Algorithm 1. Given the i.i.d. generated training data z = {zi = ( Xi,yi ) : i = 1,2,... ,T}, 
the unregularized online learning algorithm is given by g\ = 0 and, for any 1 <t<T, 

9t +i = 9t~ lt4> (yt9t{x t ))y t G Xt . (1.1) 

where { 7 1 > 0 : t G N} is usually referred to as the step size. 

Online learning algorithms for classification or regression have drawn much attentions El 
EU m 12a EH [32]. Most of them focused on regularized online learning algorithms, i.e. 
gt +1 = gt~ 7 (ytgt(xt))ytG Xt + A gt). In particular, regularized online learning with a fixed 
A > 0 was studied in [ 2 T| for the least-square loss and in [32] for the general loss function, 
and in [24l [29] for a time-varying regularization, i.e. A = A(f) > 0. 

Instead, we focus on deriving explicit convergence rates of the unregularized online learning 
algorithms (i.e. A = 0) with a general a-activating loss. Our results extend and refine those 
in m for the least-square loss and the recent result [3] Theorem 4] for the loss function 
with a Lipschitz-continuous gradient. In contrast to the results EQli derived with the step 
sizes being chosen in the special form of 0(t~ e ), we will establish a very general condition 
on the step sizes which guarantees the convergence of the last iterate gr+i of Algorithm 1. 
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Moreover, in the contrast to the proof in [3], we will soon see that our new proof here is 
much simpler and more powerful to handle general loss functions. 

The second purpose of this paper is to study the convergence of the last iterate of the 
following online pairwise learning algorithm, which is associated with an a-activating loss 
function and the RKHS TLk- 

Algorithm 2. Given the i.i.d. generated training data z = {zi = ( Xi,yi ) : i = 1,2,... ,T}, 
the unregularized online pairwise learning algorithm is given by f\ = = 0 and, for any 

2 <t<T, 

t- 1 

ft +1 = ft - Xj))(y t - yj)K {xtjXj) . (1.2) 

o =i 

Online pairwise learning involves non-i.d.d. pairs of examples, which introduces more diffi¬ 
culty than the analysis in the univariate case. The research in this direction was recently 
conducted in [HI (27, 33j. In particular, in [Hi [27] the convergence of the average of the 
iterates (i.e. ft) was established in the linear case by following online-to-batch con¬ 

version approach similar to those in the univariate case 0. Recent work [33] focuses on 
Algorithm 2 with the least-square loss. However, the analysis techniques there heavily de¬ 
pend on the nature of the least-square loss (e.g. its derivative is a linear function) and do 
not apply to the general loss function. 

In this paper, we establish, for the first time, the convergence of the last iterate of the 
unregularized pairwise learning algorithm (Algorithm 2) with a general loss function and 
derive explicit rates under the assumption of polynomially decaying step sizes. Concrete 
examples are used to illustrate our main results. The main techniques are tools from convex 
analysis and refined inequalities related to the Gaussian averages |5|. 


2 Main Results 

In this section, we present our main results related to Algorithms 1 and 2. The following 
theorem states a general convergence result for Algorithm 1. 

Theorem 1 . Assume that is a-activating with some 0 < a < 1 and let {gt : t = 
1,... ,T + 1} be given by Algorithm 1. If the step sizes satisfy that l 7 1 + ° < °°> then 
lim E[£(g T+ i)] exists. If, furthermore, g-y = arginf 9e % G £(g) exits andYfftLilt = oo, then 

T—> oo 

lim ¥,[S(g T+ i)} = inf £{g). 

T—>oo gttiG 

By the above theorem, the step sizes can be chosen in the form of 7 1 = ct~ e with some 
0 € (1777 1), and c > 0. Indeed, we can further derive the explicit convergence rate for the 
last iterate of Algorithm 1. 

Theorem 2. Assume that cj) is a-activating with some 0 < a < 1 and g-y = arginf 9e % G £(g) 
exits. Choose step sizes 7 1 = ct~ e with some 6 S ( 77 ^, 1) and c > 0. Then, 

E [£{gr+i) - Sign)] = C e , a , H T~ m '’ l - 0 \ 

where the constant depends on 9,a,c and H^-hHg (see its explicit form in the proof). 
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From the above theorem, the maximal rate for a-activating losses is of the form 0(T a + 2 ) 
which is achieved by choosing 7 t = ct a + 2 . When a = 1, the rate is of 0(T 3 ) which 
is consistent with that in [3], We can directly get the following examples from the above 
theorems, since fat ) = (1— t) q + with q £ (1, 2] is a (q —Inactivating loss and fat ) = log(l+e - *) 
is a 1-activating loss. 

Example 1. Let fat) = (1 — t)+ with 1 < q < 2 and assume that g % = arginf 9e % G 8{g) 
exits. Let {gt : t = 1,..., T + 1} be given by Algorithm 1 with step sizes 7 * = ct~ e with some 
9 £ (|, 1), and c > 0. Then, 

E [fagr+i) - Sign)] = 0{T~ 

Example 2. Let fat) = log(l + e~ f ) and assume that g% = arginf g ^ G S(g) exits. Let 
{g t : t = 1 ,..., T + 1 } be given by Algorithm 1 with step sizes 7 1 = ct~ e with some 9 £ (|, 1), 
and c > 0. Then, 

E [S(g T+1 )-£(g H )\ = 0(T~™< I’ 1 ” 0 )). 

Now we turn our attention to the convergence rates of Algorithm 2. 

Theorem 3. Assume (j) is 1-activating, and fa = arginf fen K £(f) exits. Let {ft : t = 
1,... ,T + 1} be given by Algorithm 2 with step sizes 7 * = ct~ e with some 9 € ( 5 ,1) and 
0 < c < 4^77 Then, for any 5 £ (0, min(# — 1 — 0)), t/iere holds 

E[5(/t+ r) - £(/«)] = C'o.i.w T“ min ( 

where the constant Dg t0l ^ depends on 9,5 and ||/-h||g (see its explicit form in the proof). 

If, moreover, the gradient of (j) is uniformly bounded then the rate in the above theorem can 
further be improved. 

Theorem 4. Under the same assumptions of Theorem [3] and further assuming |<//(s)| < 
B < 00 for any s£i, then, for any 5 € (0, min( 7 ,1 — 9)), we have 

E[£(/ t+1 ) - £(fa)\ = 

where the constant Cg^ depends on 9,5 and ||/h||g (see its explicit form in the proof). 

From the above theorem, we see that the maximal rate for Algorithm 2 associated with an 
a-activating loss is arbitrarily close to 0(T~e ). If, moreover, the gradient of the loss function 
4> is uniformly bounded then the maximal rate is improved to 0(T~ 5 ). In particular, from 
the above theorem, we can immediately get the following examples since fat) = (1 — t)\ and 
fat ) = log(l+e _i ) are both 1-activating loss functions, and the gradient of fat) = log(l+e _t ) 
is uniformly bounded by one. 

Example 3. Let fat) = (1 — t)\ with 1 < q < 2 and assume that fa = arginf f£H K £{f) 
exits. Let {gt : t = 1,..., T + 1} be given by Algorithm 2 with step sizes 7 1 = ct~ d with some 
9 £ ( 7 ,1) and c > 0. Then, for any 5 £ (0, min(0 — 1 — 9)), there holds 

E [i(f T+1 )-£(fa)] = o(r- mi 
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Example 4. Let (ft(t) = log(l + e _t ) and assume that fn = arginf f G -^ K £(f) exits. Let 
{gt : t = 1,..., T + 1} he given by Algorithm 2 with step sizes 7 1 = ct~ e with some 6 G ( 5 ,1), 
and c > 0. Then, for any 6 G (0, min(|, 1 — 9 )), 

E [£(f T+ 1 )-£(f n )] = C?( 


3 Proofs of Main Results 

We derive some useful properties of the a-activating loss function 0, which play critical roles 
in proving main theorems. Some of them may be of interest in their own rights. 

Proposition 1. Assume that 0 ■ E —» R is convex and its gradient is a-Holder continuous, 
i.e. L := sup |0 7 (5) — 0 / (s)|/|5 — s|“ < 00 . Then, for any s,s G M, the following properties 

hold true. 

(a) c j>(s) - 0(5) - 0'(5)(s - 5) < j^\s - 5| 1+a . 

(b) 0(5) > 0(s) + 0'(s)(5- s) + ^p^|0'(s) - (f (s)^. 

(c) (0'(s) -0'(5))(s-S) > ^q^|0'(s) - 0 , (S)| i s 2 \ 


(d) If, moreover, 0(s) > 0 /or any s € M, t/ien |0'(s)| « < — —L<* 0(s). 

Proof. Part (a) directly follows from the fact that the assumption that |0 7 (s) — 0 7 (5)| < 
L\s — s\ a and the fact 

0(s) — 0(5) — 0 , (5)(s — 5) = f (0 7 (9s + (1 — 0)5) — 0 , (5))(s — 5)d0. 

Jo 

For part (b), let 07(5) = 0(5) — 0 / (s)5. Notice that 070) is convex, differentiable and its gra¬ 
dient 0(.(5) = 0'(S) — 0 ; (s) is a-Holder continuous. In addition, 070) achieves the minimum 
at s since -0'(s) = 0. Hence, for 6 = L&, 

i>s(s) < 07(5- j(0'(5) - 0 / (s))|0 / (5) - 0 , (s)| i ^) 

< 0 S (5) +0'(S)(-i(0'(5) - 0'(s))|0'(5) - ^(s)] 1 ^) 

+TTsIK < ^ / ( 5 ) “ 0 '( s ))l0'(s) - 0 , (s)l^ a | 1+ “ 

= ^(S)-^p#l0 , ( 5 )-0 , (S)l^, 

where the second to last inequality used the fact that 070) satisfies part (a). By the definition 
of 07 (-), re-arranging the terms in the above estimation yields the desired result of part (b). 

For part (c), switching the roles of 5, s in part (b) yields that 

_ j_ 

0(s) > 0(5) + 0'(5)(s - 5) + aL I0'(s) - 0 , (5)| i £ 9L . 

1 + a 
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Adding part (b) and the above inequality implies part (c). 


For part (d), the case for a = 1 was proved in [521. We generalize their proof to the general 
case 0 < a < 1. Indeed, we only need to prove the case (/)'(s) ft 0. For any s € R, let 
r = s — ((1 + a)L)~a |(//(s)|a . By the mean-value theorem, there exists £ in the range 

(s,r) (if <ft(s) < 0) or (r, s) (if (ft(s) > 0) such that ft{r) = ft(s) + <ft {ft){r — s). Hence, 


0 < (f>(r) = </>(s) + <ft(s)(r - s) + (</>'(£) - <ft(s))(r - s) 

< <f>(s) + <ft(s){r — s) + L\r — s||£ — s|“ 

< 0(s) + ^(s)(r-s) + L|r-s| 1+ " = ft(s) -^ jrr £, _ =|^(s)| 1 ^ st , 

(l+a) ” r a 

which completes the proof of part (d). □ 


We end this section with a comment on deriving the 


3.1 Proofs for the Convergence of Algorithm 1 


The main idea for proving the convergence of Algorithm 1 is to derive a recursive inequality 
for the sequence {Rt := E [£ (g t ) — £ (g-p)] : 1 < t < T+l} (i.e. the relationship between Rt+i 
and Rt), and then apply induction on this inequality. To this end, we need to establish the 
boundedness of the learning sequence {gt : t = 1,2,..., T + 1} generated by Algorithm 1. 
Throughout the paper, we use the conventional notion that 7j +a = 0 whenever t < k. 

Denote k = sup xgi ^ yjG(x, x ). 

Lemma 1. Let {gt : t = 1,..., T + 1} be generated by Algorithm 1. Then, 

t 

E[£(g t+1 )\ < (1 +£{gi))exp(A a ^2rf +a ), 

3 = 1 

where A a = L 2 ( 1 + 1) q k 2 1 1+q 1. 


Proof. Since <f> is convex and (ft is of a-Holder continuous, by part (a) and part (d) of 
Proposition [T] we have 

<f>(y9t+i(x)) < (j>(yg t (x)) + (ft(yg t {x))y{g t+ i - g t (x)) + j^\g t+1 (x) - g t (x )| 1+ “ 

= 4>(ygt(x)) - lt{<ft{ygt{x))yG x ,4f{ytgt{xt))y t G Xt ) + j^\gt+i{x) - g t (x)\ 1+a 

r ^2(1+0:)^ 1 +Q! . . 

< (j)(yg t (x)) - 7 t{<ft(ygt(x))yG x ,<ft{ytgt{x t ))y t G Xl ) + -— j^-ft — \(ft(y t g t (x t ))\ 1+a 

< <t>(yg t (x)) - ^t{(ft{ygt{x))yG x ,f)'(ytgt{.x t ))y t G Xt ) + A a ^l +a \f){ytgt{xt))\ a ■ 


Taking expectation of both sides of the above inequality with respect to z = (x, y) and 
samples {z\,..., zt}, and noting that gt only depends on {z \,..., zt- 1 }, we have 


E[£(gt+i)] <E[£(g t )] - 7 fE || J z (ft{yg t {x))yG x dp{x 


+A a ^ +a E[f z \4>(yg t (x))\ a dp(x,y) 
<E[£(g t )] - 7 tE \\ J z (ft{yg t (x))yG x dp(x, 
+A a ^ +a (E[J Z (/)(yg t {x))dp(x,y)]j 


= E[£(g t )\ - 7 *E 
< (1 + A al } +a )E 


II $ z (t>'{ygt(x))yG x dp{x,y)\\ 2 G + A a ^ +a {E[£(g t )]) a 
£(gt)\-^tE || j z (ft(yg t {x))yG x dp[x,y)\\ 2 G +A a ^} +a . 


(3.1) 
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Consequently, 


E[<^m)] < (l + A al l +a )E[£(g t )] +A al l +a , 

The above inequality implies that 

£[«(»+.)] < nk(i+ A^nm) + a„y.u nu+i(i + -v^bF” 

< n , 1(1 +^7, I+ “)ffa)+ e5=i [nu-a+ A al +) - nu+it 1 + A + + + 

=n'.iu+- 4 « 7 , 1 + “)f(si)+[nU(i+ a ++) - 1 ] 

< (l+£( 9l))exp(.l < ,Ej=i7j + “)' 

This completes the proof of the lemma. □ 


From the above lemma, we know that if EEi lt +a < 00 then, for any t € N, there holds 


E[£(ffi+i)] < (1 + E( 5 i))exp^ Q ^ = 1 7 j + ") 


< £>oo := (1 + £(5i))exp(A a E~i7j +a ) < 00 . 


(3.2) 


One typical example of step sizes is of the form 7 't = -jf with some 9 £ 1). In this case, 

notice that 

e;= 2 7 , 1+0 = c^zUr e(1+a) = c 1+a (i + EUr e{1+a) ) 

/ -1+aq I ft n -e(l+a)j„ / c 1 +“0(l+a) ^ 2 c 1 +“ l' 3 " 3 ) 

<c (I + JjS ’as < 0 (1 +Q )_1 < ofyf^pi- 

Hence, for any t G N, 

2 t 4 „c 1+ “ 


E [£{gt)\ <-Doo < (l + f(pi))exp 


6(1 +a) - 1 


(3.4) 


We now turn our attention to estimating the boundedness of E[||^ — g^||g]. 

Lemma 2. Assume that gu = arginf gG % G £(g) exists and let the learning sequence {gt : t = 
1,..., T + 1} be generated by Algorithm 1. Then, 


E[||<7t+i — gu IIg] — Wdnllh A B a Doo 


2 a 
1+a 


t+ 

3 =1 


where B a := n 2 (l + a) 2 L 1 + a a 1 + a . 


Proof. Notice that, since gu = arginf gg % G £(g), 

j <t>'(y 9 H{x))yG x dp(x,y ) =0. 

By the definition of gt +1 in Algorithm 1, E[||^ + i — </%||g] is therefore bounded by 

E[||fft - gnWc] - 2 lt^[((!>'(ytgt(xt))ytG Xt ,gt - gn)c\ + +t^[\W (ytgt(x t ))G Xt \\ 2 G ] 

< E[||ft - gu IIg] - (ytgt(x t ))ytG Xt , g t - gn)a\ + Tt ^[W {yt9t{x t ))\ 2 ] 

= E[|| g t ~ gnWcl - 2 lt^[(![f'(ygt(x))yG x - </)' (yg n (x))yG x ]dp(x,y), g t - g n ) G \ 
+^K 2 E[\(J)'(ytgt(xt))\ 2 } 

< m\gt - gnWc] + it (ytgt(x t ))\ 2 } 

< E[|bt - 9h\\g\ + r r?n 2 (m^(ytgt(x t ))\ h ^ L }) T +^ 
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where the second to last inequality used the fact, by part (c) of Proposition [Q 


(!W{y9t{x))yG x - 4>'(yg H (x))yG x ]dp(x,y), g t - gn)c 
= JW(y9t{x :)) - (j}'(y 9 n(x))]y(g t (x) - g H {x))dp{x,y) > 0. 


Also, by part (d) of Proposition [TJ we have \4>'{yt9t{xt))\ a < 
Putting this back into (13.51) . we know from (|3.2D that 


1+2L , (l+a) 1+ a - 1 


£° <l>(yt9t(xt))- 


E[||gt+i -9h\\g\ < E[||gt - gnWcl + 7t« 2 (1+a) i£ rTg [ E (£(fft))] 

a ] 1 " 


2 a 
1+a 


< 


E[||gt - gn\\ 2 G ] + 7t 2 « 2 (1+a)3 ^ TTS (£>oo) 1+ “ Q 

rv 1 +a 


which directly yields the desired result. This completes the proof of the lemma. 


□ 


Let A*, = \\g n \\ 2 G + B a D^+ a i 7 ?. Then, if the step sizes are in the form of with 

0 e then, by (J33D, 


E[ll 9t 



< WgnWc + c 2 B a D 

< Doc < llg^llc + 


1 +a \pt-l -—20 

oo Z-/?=l J 
2a 

20 c 2 B a D^ 

20-1 


(3.6) 


We are now in a position to prove the main theorems for Algorithm 1. 
Proof of Theorem |Tl By (13.11) and (13.21) . we have 


1 )] <E[£(&)] - 7 *E || J ft(ygt(x))yG x dp{x,y)\\G 


+ A q (1 + £>00)7 


1 +a 


The above inequality implies that 

K[£(gt+i)\ <E[£(gt)] +A a {l + DocW t +a . 
Consequently, for any fixed t <T, 


(3.7) 


E[S( 9 T+i)] <V[£(gt)\ +A a (l + Doc)Y J ^t +a - 

j=t 

This means that hm'r-K X >E[£(gT+i)] £ E [£(<&)] +A a (l+D 00 ) 7t +0 ’ which also implies, 

since 7 1 + ° < oo, that 

OO 

limT-^-ooE \£(gT+i)\ < lim^^lE^^)] + A a ( 1 + £*00) ^ll +a = I™t-Kx> E [£(&)] • 

j=t 

Hence, e := lim£_ KX) E[£(g t )] exists and, apparently, inf gG % G £(g) < e < Doo < 00 where the 
last inequality follows from equation (|3.2D . This completes the proof for the first part of the 
theorem. 

Now it remains to prove, if we further assume that g-p = arg inf gG % G £(g) exists and 
YlfLilj = °°j that e = inf ge % G £ (g). Let us assume, on the contrary, that £\ = e — 








infgg^c £(g) > 0. Let R t := IE [<£T(^t)] — inf gg ^ G £(g) for any t E N. In this case, there exists 
t\ such that, for any t>t\, Rt > However, from (13.71) . we know that 


#t+i < Rt - || / </ , '(?/5 f t(^))2/G'. T d y o(x,y)||^ + A a (l + D^)^ 01 . (3.8) 


By the convexity of </>, 


£{gt)-£(gn) < J z ^(y 9 t(x))y(g t (x) - g%{x))dp{x,y) 

= (fz ( i ) '(ygt( x ))y G ^ d p( x ,y)^gt - gn)c 
< [II Jz^(ygt(x))yG x dp(x,y)\\ 2 ] *\\g t - gn\\a- 


2a 

Also, observe that = ||< 7 ^||g- + B a DYTLi 7? < °°) since 7 j + “ < 00 and a < 1. 


This implies that 


E 


\J 0'(ygt(x))yG x dp(x, y)\\ 


> 


Rt 


^[\\gt -9 hWg\ d °° 


Rt 


Putting this back into (13.81) yields that 

Rt+i Rt — 'ytRt /Ax> + A a {l + D 00 )7 t 1+ “. 


(3.9) 


This means that 

lirriT^oo Yl= i 'ytRt/D 00 < Ri + A a (l + D^) Yt=i 7t + " 

< + A a ( 1 + Dqo) y™, 7 t 1+a < oo. 

However, Yt=i 7 t-Rt/A» > 77 — Yt=t! 7i> w Lich implies, by the assumption that YtZi 7 1 = 
oo, that 

_ T _ £ 2 £ 2 00 

lirriT^oo ItRt/Doo > -Y-—±- Y It = oo. 

t= l 4jty °° t=ti 

This leads to a contradiction. Hence, sq = lim Rt = 0. This completes the proof the 

t—> OO 

theorem. □ 


We now turn our attention to proving Theorem [2] by an induction based on the recursive 
inequality ([3.91) . 

Proof of Theorem \2\ We prove the theorem from the recursive inequality (13.91) . Since 
"ft = ft with some 6 E ( 177 ,1), inequalities ([3.411 and (I3.6[i hold true. Let fd = min(7^, 1 — 0), 
and choose 


D = max< D c 


r _2c_vmin(f,i^) 


(2 /? ^oo) 


minp+f) ,|) D c 


+ 


A q (1 + Doo)D 0 


Denote 


to — 


,2cD 1 
2(_)^ 
-L^OO 


By the definition of D and /d, we know that D > anc l 0 < 6 + fd < 1 which further 
implies that to > 4. Since 


D > max 


{Axu ( 


2 c 

DZ 


) mln ( 2 . 1 e e ) (2 0D O 


, min(l+ 2 . 1 )' 
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we have 


E[£(g t ) - £{gn)\ < Doo < — < 73, Vi < i 0 . 

t o 

Now we assume that Rt < jp for some t £ N and t > to and we are going to prove that 
#t+i < (t^)J b y induction. 

To this end, let F(x) := x — 'ytx 2 /Doo and notice that F is increasing when x € (0, ]. 

Observe that t > to > (jy^-) e+l3 which implies that € (0, )■ Combining this with 

(13.91) and the induction assumption R t < 2 (i.e. Rt € (0, )), we have 


< F(i2f) + A a (l + Doo) 7 t 1+ “ < ) + 4,(1 + 1 ) 00 ) 7 , 


l+a 


< 


D 


- w 


< 


D 


- W 


1 - (-£2- - 

VDoo 

1 - (£2. - 

^00 


D 

Aa(l+D c 

D 


>l t 2/3-ea\ t -0-0 

l) t -o-P 


(3.10) 


where the last inequality used that fact 2/3 — 9a < 0. By the definition of D, D > 2s&. _|_ 

\J a^+Doo)d Z which implies t h at jD- _ A a (i+D x ) > L Putting this back into (13T01) yields 
that 


R t+i < ~p 


1 -t 


- 9-0 


D 

~ ¥ 


1 -t 


-1 


D ( t ~ 1 \ < = D 
t py t ’ ~ tP^t +(f +1)0’ 


where the second inequality used the fact that 9 + (3 < 1. This completes the proof of the 
theorem. □ 


3.2 Proofs for the Convergence of Algorithm 2 


In this subsection, we prove the main theorems related to Algorithm 2. The main idea is to 
derive a recursive inequality on the sequence {Rt := E [£(ft) — £(fu)\ : 1 < f < T + 1} (i.e. 
the relationship between Rt+\ and Rt), and then apply a smart induction on this inequality. 
To do this, let us establish some useful lemmas. Denote k = sup x xtXxX \f F((x, x), (x, x)). 

Lemma 3. Assume <f> is 1-activating and /% = arg inf feu K £(/) exists. Let {ft : t = 
1,..., T + 1} be generated by Algorithm 2. Then 


E [ll/m -/will'] < IWIhWk + ^u¥ + lnt )] exp((l + 32K 4 L 2 )^ 7j 2 ), 


3= 2 


where a\ = J J \\(f>'{{y - y)fu{x,x))K^\\ 2 K dp{x,y)dp{x,y). 


Proof. E[||/i+i - fnW k\ is bounded by 

t -1 

E tll ft - fn\\ 2 K\ + (t=i)^ E [|- yj)ft{x t ,Xj)){yt - y^^W,)|| 2 ] 

3 = 1 

t-1 

~ yj)ft( x ^ x 3))(yt - yj)(ft(x t ,xj ) - / h (^,^))] ■ 

J=1 


(3.11) 
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Noting that / / </>'((y - y)fn(x, x))Ku x) dp(x, y)dp{x, y) = 0 , we have 
J z J z 

- E Ej=i - yj)ft(xt,xj))(y t - yj){ft(x t ,xj) - f H (x t ,xj))\ 

t -1 


= ~ yj)ft(x t ,Xj )) - 0'((y* - yj)fH(xt,Xj))\(y t - yj)(ft(x t ,Xj ) - / w (x t ,Xj))] 

i=i 

t-i 

~ yj)fn(x t ,Xj))(yt - y j )(f t (xt,x j ) - fu{x t , x.,-))] 
i=i 

t-i 

< </>'((yt - yj)fn(x t ,Xj))(y t - yj){fn{x t ,Xj ) - /t(x t ,x.j))] 
i=i 

t-i 

= E [(J^((j/ - Vj)fn(x,Xj))(y - yj)K {XtXj) ,f u - f t ) K \ 

i=i 9 

< v*- l(E[||/t - /wlllD^w < - 1 )7t E [||/i - /will] + ^]- 

Also, Efll^" 1 ! - yj)ft(xt,Xj))(yt - yj)K^ Xj) \\ 2 ] can be bounded by 

t -1 

2 E[||^(0'((y t - yj)ft{x t , xj)) - 4>'{{y t - yj)fn(xt,Xj))){y t - yj)K (a 


(x t ,x. 


3 =1 
f —1 


+ 2 E[||X>'((yt - yj)f n (xt,Xj))(y t - yj)K^ uXj) 

< 32 k 4 L 2 (t - 1) 2 ||/ t - jw||| + 2 </ ~ 1 ) (7 w- 

Putting these two estimates into (13.111) . we have 

E Pl/t+i - /will] < (1 + (32 w 4 A 2 + l) 74 2 )E[||/ t - /«|||] + (2 ^+ 1 i ) 4 . 

Therefore, 

E[ll/<+i - /will] < nW 1 + (32« 4 r 2 + ih|)ll/wlll 

+7, £5=2 nu,+i(i + (32«‘r 2 + 1 )t 2 )[ 2 tJ + ft] 

< exp((32x 4 L 2 + l)£‘ =2 7, 2 )ll/wlll 

+AlES= ! [nL i ( 1 + (32S-V 2 + 1)7-2) - nU+l(l + (32SV 2 + 1)7-2) 
+<7 £5.2 nU+iU + (32s 4 t 2 + ibSjW 

< exp((l + 32w 4 L 2 )^‘ = 2 7 2 )|ll/«lll + p|({ 4 + In*)] ■ 

This completes the proof of the lemma. 


□ 


From the above lemma, we know if 7 ^ = A with some 0 € (/, 1). Then, 

E[ll ft - /will] < Et := exp( (1+ 2 3 ^ 2) ^ )[||/^||| + 4(4 + Ini)] (3.1 2 ) 
The next lemma estimates the boundedness of the learning sequence under the RKHS norm. 
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Lemma 4. Let 0 be 1-activating and {ft : t = 1, ..., T + 1} be given by Algorithm 2. If 
7 tK 2 < for any t £ N then 


II ft + i\\K < D t = Cf 


\ 


ST#. 


J=2 


where = vEso */ i/iene exists so £ R such that 4>' (sq) = 0 , and C<p = 20 ( 0 ) + - 

otherwise. 


\\ft+i\\K ^ 11/illfiT + (t=rp-|| Ej -=1 0'((yt - yj)Mx t ,Xj))(y t - yj)K {xuXj) I 
-|rr E$= l ^( (yt - Vj)ft{xt,x j ))(yt- Vj )ft(xt, Xj ) 


Proof. Write 

12 <||/t|& + ^ 11 ^ 1 ‘*" 

Se5=U'( 

< ll/*l& + & E*=l - ViMxt'Xj ))| 2 

-20'((yt - yj)ft(xt, Xj))(y t - yj)ft(x t , Xj) 

< \\ft\\ 2 K + 7tSup seK [4(</> / (s)) 2 7tK 2 - 20'(s)s]. 

Therefore, the desired result follows directly from the following claim: 


sup[ 4 (</> / (s)) 2 7 t «: 2 - 20'(s)s] < C|, 

sSK 

To prove (13.131) . we discuss the following two cases. 


if 7 t K 2 < —. 
? - 4L 


(3.13) 


Case 1: 4>'(s) < 0 for any s £ M. Firstly, consider s > 0. By the convexity of 0, —scj/{s) < 
0(0) — 0(s) < 0(0). In addition, 0'(O) < 0'(s) < 0. Hence, for s > 0, there holds 


4 ((//(s)) 2 7 t K 2 - 20'(s)s < 4 ( 0 '(s)) 2 7 tK^ + 20(0) < 


L 


+ 20(0). (3.14) 


Secondly, consider s < 0 which implies s(j/{ 0) > 0. Since 0 7 (-) is Lipschitz continuous, part 

- 0 '(o)) 2 _ iWdi 

L L 


(c) of Proposition Q] implies that (0'(s) — 0'(O))s > — ^ y- ^ ^ ^ Mil _ Therefore, 


for s < 0 , we have 


4 ( 0 '(s)) 2 7 t «; 2 - 20'(s)s < 4(0'(s)) 2 7 4 k 2 - 2(0'(s) - 0'(O))s 

<4(^( S ))V-WM0! 

< (0'(^)) 2 _ 

- lL ^^) l - 2 | 0 '( o )|) 2 + ^^<™ 2 


(3.15) 


Combining the above estimates (13.141) and (13.151) yields that 

2(0'(O )) 2 


sup[ 4 ( 0 '(s)) 2 7 i «: 2 — 20'(s)s] < 20(0) + 


L 


Case 2: 0 7 (si) > 0 /or some si £ K. Since 0' is increasing and 0'(O) < 0 by assumption, 
therefore si must be positive and there exists sq > 0 such that 0'(so) = 0. Hence, by part 
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(b) of Proposition [TJ we have 

4(^'(s)) 2 7iK 2 - 2 (j)'(s)s = 4( < /)'(s)) 2 7iK 2 - 2 (ft(s) - cp , (s 0 ))(s - s 0 ) ~ 2so^'(s) 

< 4(^(s)) 2 7 t K 2 - £(<?/(s) - ^'(s 0 )) 2 - 2s 0 <fi'(s) 

= (47 t K 2 - x)(^(s)) 2 - 2so<f>'(s) 

< -r(0'(s)) 2 - 2s o 0'(s) = + Ls 0 ) 2 + L(s 0 ) 2 , 

which implies that 

sup[4(</>'(s)) 2 7tK 2 - 2<£'(s)s] < L(s 0 ) 2 . 

Combining the estimates in the above two cases yields (|3.13j) . This completes the proof of 
the lemma. □ 


From the above lemma, we know that if 7 1 = -^ with 6 € (0,1) then 

t -1 

\\ft\\K < 


N 


C x . £ ^ .V 


5=2 




(3.16) 


Our analysis for Algorithm 2 also needs the concept of Rademacher averages (5j . Let T be 
a class of uniformly bounded functions. The (empirical) Rademacher average R n (P) over P 
is defined by 


Rn{F) := E ct 


1 

sup — 

feF n 


n 


3 = 1 


where {zj : j = 1,2,... ,n} are independent random variables distributed according to 
some probability measure and {aj : j = 1,2,... ,n} are independent Rademacher random 
variables, that is, P{(7j = 1) = P((?j = — 1) = Another useful complexity to describe the 
capacity of T is the Gaussian average which is defined by 


G n [P) := E 0 


1 . ' 

SU P 


where {gj : j = 1,2,... ,n} are independent Gaussian Af(0, 1) random variables. The fol¬ 
lowing inequality (e.g. m Remark 2.26]) describes the relationship between the above 
complexity averages: 

pGn{F) < R n (F) < pG n {F). (3.17) 

In n 

Here, p > 0 and p > 0 are absolute constants independent of F and n. 


We begin with stating the well-known comparison principles for Gaussian process (e.g. m\) 
which will be used to prove a useful property of Gaussian averages. 


Lemma 5. Let {Xg : 6 € 0} and {Yg : 0 € 0} be two zero-mean Gaussian process indexed 
by the same countable set 0 and suppose that 

E g [{Yg - Yg) 2 } < E g [(Xg - Xg) 2 }, ye, dee. 


Then, 


E fl [sup Yg} < Eg [sup Xg\. 
e e 
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We now can derive the following property related to the Gaussian average. 

Lemma 6. Let Fj{9) be a set of functions indexed by parameters 9 = {9i,6 2 ) € @1 x ©2, 
Hj(9 1), and Jj{9 f) be a set of functions indexed, respectively, by parameter 9\ E ©i, and 
02 E ©2- Assume, for any 9 = { 61 , 62 ) ,9 = { 61 , 62 ) € @1 x ©2, that 

I Fj{6) - Fj{9)\ 2 < | Hj{6i) - Hj{9 1)| 2 + | J# 2 ) - Jj{6 2 )| 2 . 


Then, 


E g [ sup y^gjFi{ 6 )] < E g [ sup y^g j H j { 6 i)\ + E g [ sup ^g r /j(6> 2 )]. 


(01,02)601X02 


01601 J=1 


02602 j = j 


Proof. Let ,..., g 2n be 2 n independent J\f{ 0,1) Gaussian variables. Introduce two Gaussian 
processes: 

n n 

Xq — N ( gjFj{9) and Y e = Y J [9iH j {di) + g n+j J j {e 2 )]. 
j =1 3 = 1 

Then, E,[(X 0 - X,-) 2 ] = £? =1 [F#) - F#)] 2 , and E g [(Y, - T,) 2 ] = £" =1 - 

Hj{ 6 1)) 2 + {Jj{6 2 ) — Jj{9 2 )) 2 } ■ According to Lemma El we have 

n n n 

E g [sup^2g j F j {6)} < E s [sup( E 9jHj{9\) + ^ ' ffn+j Jj (^2))] 


0S0 


i=i 


0 e© 


i=i 


3 = 1 


< E ff [ sup V^,(ft)] + E g [sup VW^)] 

01601 j = 1 02 602 j = l 


= E g[ su P0 ie 0i £"= 1 ft##i)l + E g [^sup J ?'( 02 )]- 

This completes the proof of the lemma. 


02602 j = 1 


□ 


Denote 

Mf = sup_ \4>{t)\. (3.18) 

\t\<2nDt 

We also need to bound the following term defined by 
~ 1 f 

:= VS {ft) - - —-V / 4>'{{y - yj)ft{x,Xj)){y - y j )K^ x . ) dp{x,y), 

1 ~ 1 j=1 Jz 

where VS {ft) denotes the functional derivative of £(•) at ft given by 

V£(/t) = I[ (f'{{y - y){f t {x,x))){y - y)K {x ^dp{z)dp{z). 

J JZxZ 

Using Lemma |6l we can prove the following estimation. 

Lemma 7. Let <f be 1-activating, and {ft : t = 1,..., T + 1} be given by Algorithm 2. If 
7tK 2 < then, for any t > 2, 


E[|| A 


t\\K\ < 


8\F2y{LK,Dt + Alf)n 
Vt~F I 
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Proof. For any fixed z = {x,y), letting £f, g (z,Zj) = ^'{{y-y j )f{x,Xj)){y-y j )g{x,x j ). Since 
7t« 2 < Y by Lemma HI \\ft\\K < D t . Notice 

||At||i<: = sup [//0'((y - y){ft(x,x)))(y - y)g(x,x)dp{z)dp{z) 
llslk<i JJ 

-th £$=i Jz <^((y - yj)Mx,Xj))(y - y j )g{x,x j )dp(x,y )] 

< sup [ / / <//((y - y)(f(x,x)))(y - y)g(x,x)dp(z)dp(z) 

\\f\\ K <D t JJ (3.19) 

-J =1 £$=i J z <//((y - yj)f(x, Xj))(y - yj)g{x, xj)dp(x, y)] 

f 1 

= / sup [E g £ ft g (I, z) - -—- Y £ f>g (z, Zj )] dp{z). 

JZ \\f\\ K <D t * 1 ■ 

IIsIIk<i 

For any fixed z = (x, y) , by the standard symmetrization technique [5 > from the above 
inequality we have 


1 i" 1 

sup [E z£f,g(z, z) - —— Y €f,g( z , z 3 )] 

II K<Dt 1 1 j= 1 


It/ll K<Dt 
IMIk<i 

< 2E z E, a sup 


1 


t -1 


ll/ll<Ui t 1 
IIsIIk^ 1 j 




< 2/xE z E s sup 

ll/ll <-c 
llslljr<i 


1 


t -1 


/!<»/. i 1 j =1 




(3.20) 


Let 0i = {/ G Uk : ||/||x < A} and 0 2 = {g G : Hfflk < 1}- Then, for any /, / G 0i 
and € 0 2 , there holds 

I?/, s (£,z) ~ £f,g(z,z)\ 2 < {4:V2LH\f(x,Xj) - /(x,xj)|) 2 + (2V2M?\g(x,Xj) -g(x,xj )|) 2 

Applying Lemma [6] with Fi(0) = £f, g (z,z) with 9\ = /, 0 2 = 5, Hj(9\) = 4\/2 L7if(x,Xj), 
and JjiOf) = 2\/2 Dfg(x,Xj) yields that 


En sup 


1 


"3 ““*1 + _ 1 / 
ll/ll<u t I 1 ■ , 
Il9llif<l J 


Y^S'-a^'- z P 


j t—1 i-1 

< 4\/2LkE 9 [ sup -—- Y'fifj/fozj)] + 2 v / 2M/’E s [ sup -—7 V] 9 j 9 ( x , x j)} 

\\f\\<Dt f ~ l Yi Nk<i t - 1 

^ t-i t-i 

= 4V2LkE 9 sup_ (—— -Ygi K (x, Xj ),f) K + 2\/2M/E s sup (——- Y9j K (x, Xj ), 9 )k 
iifii<m r 1 -_i ||s||jf<i 1 1 j=i 


< 

< 

< 


j=i j=i 

4:\^LKD t E g \\j^YYi9j K ( x , Xj )\\K + 2\/2M/e 9 ||A t X]j=i5j-^(x,xj)l|if 
4 ^LkA (E s || Aj- Yf = \ 9j K {x , Xj) \\\) 1/2 + 2 V2Mf (E ff || Aj E%\ 9jK [x , Xj) f K ) 1/2 
4 V2(L7iDt+Mf)K 
f/tZ T 


Putting the above estimation, ()3.191) . and (13.20[) together yields the desired result. 


□ 
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Denote, for any t € N, by R t = E [£(ft) —£(fu)\. We derive the following recursive inequality 
for Rt which is critical for proving Theorem [3l 


Lemma 8. Let cf> be an 1-activating loss, {ft : t = 1 ,,T + 1} be given by Algorithm 2. 
Then, for any t > 2, 


Rt+i <Rt- 7 M + 


l6V2^K 2 Mf{LTiDt + Mf)i t 

y/t^l 


+ 4 LHW t {Mt) 


(3.21) 


Proof. By part (a) of Proposition [TJ we have 


<K(y - V)ft+i(x,x)) < </>((y - y)f t (x,x )) + (0'((y - y)ft(x,x))(y - y)K^,f t+1 - f t ) K 
+2L\f t+1 (x,x) - f t (x,x )| 2 . 


Therefore, letting A t = V£(/*) - £* = l Iz </>'(& ~ Vj)ft( x , x j))(y - yj)K^ Xj) dp{x, y ), we 

know that E[£(/t+i)] is bounded by 


~ E (V£(/t), I £$=i /jz </>'((y - Vj)ft{x, Xj)){y - yj)K [XjXj) dp(x,y)) K 
+^Tp(£$=i l^'((yt - yj)ft(x t , Xj))(y t - yj)|) 2 

< Eg(/*)] - 7*E[||V£(/ t )|&] + 7tE(VE(/ 4 ), A t ) A - 
+e[£*= 1 1 <j /( (y t - yj )f t (x t ,x j ))\ 2 J 

< ns(ft)] - 7t E [||V£(/t)||jf] + 7tE[||VT(/i)||x||Ai|U-] (3.22) 

+i fert E K=i ~ yj)ft( x t> x i)) I 2 ] 

+e[£ j-=i 1 0((vt - yj)ft{xt,xj ))\ 2 ] 

< Eg(/ t )] - 7iE[||VT(/t)|| 2 ,] + 2<Z'Y t M?E[\\A t \\ K \ 

I <P((yt - yi)/t(*t,*j))l 2 ] 


Notice 


E[' 


1 

t- 1 


t-i 


-yjO/tC**.^-))! 2 ] A (m/) 2 . 

1=1 


(3.23) 


By the convexity of 4>, £{ft) — £{fu) A (V£(/t),/t — fu) which, combined with Lemma 01 
implies that 


E[||V£(/t)||^] > 


(£(ft)-£(fn)) 2 R 2 t 
mft~fn\\ 2 K \ ~ Rt 


Combining the above inequality, (13.221) and (13.231) together, by letting R t = E [£{ft)-£(f-n)], 
we have 


R 2 


Rt+i < Rt. — lt~pr + 


16 V2pK, 2 M?(LnDt + M?)r/t + 4L ~ 4 7 2 ^ M 0)2 
\Jt — 1 


This completes the proof of the lemma. 


□ 


From (3.9), in analogy to the proof used in Theorem [TJ one can easily see that a sufficient 
condition to guarantee the convergence of E [£{ft)\ to £(fn) can be stated as follows: 


OC 





+ 7 2 t(Mf) 


2 


< oo. 


(3.24) 
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This sufficient condition is not as neat as its counterpart to guarantee the convergence of 
Algorithm 1 as given by Theorem [TJ Observe that the randomized gradient Y^j=i ftiiut — 

yj)ft(xt, Xj)){yt — yj)K(. XtX .^ in Algorithm 2 is not an unbiased estimator of the true gradient 
IIzxZ ( i ), ((y~y')ft( x ^ x '))(y~y') K ^,x) d P( x ^y) d P( x ^y')i even conditioned on{z 1 ,z 2 , • • .,z t - 1 }. 
This fact may partly explain why our techniques can not derive a similar sufficient condition 
as the one for Algorithm 1 which is stated in Theorem [TJ 

Lemma 9. For any x, is, a > 0, there holds 

, a v 

affix < nx + a In (—). 

y ve' 

Proof. The lemma directly follows from the inequality in [21], i.e. e~ vx < (^) a x _a . □ 


We are now in a position to prove Theorem [3] by induction. 

Proof of Theorem [3] Denote a% = ||/^||^ + 4<r|,, and for any 6 € (0, min(0 — 1 — 0)), 


let 


Now let 




D ■= Cg,s,n = max{D 1 ,D 2 ,D 3 }, 


(3.25) 


where Di = f exp( ^^e-i )c W, 


tp+iyo+p) 

D 2=2 6 




-(1+32 k 4 L 2 )c 2 


) a H 


2 Lk 2 exp(-— 2g _ 1 

n 2 r~2 ( (l+32K 4 L 2 )c 2 1 a 

+ ln(2 


(20-l)(0+/3) 

2 a^La 2 exp( (1+32 ^f )c2 ) [In 2 + ^ ln( c 
(1 + 32 k 4 L 2 )c 2 


and 


D 5 = - exp( 
c 




e+i 


20-1 


) [d-H + —f In 1] + 4 k 2 (Lck 2 + 8V2/t)( * A+ D ' 1 ' r """ 2 


+ l+'(0)l) • 


Let to = |_2( 


cD 


2 exp( 


(1+32k- 


4 L 2 jg- —) 9+,3 _|- Since D > §exp( 


-) a n 


(1+32 k 4 L 2 )c 2 
20-1 


)a% and 0 + /3 < 1, we 


have to > 2. Notice 

= E[£(/ t ) - £(/«)] < 2LK 2 E(||/ to - / w |&) < 2L7?E to 
< 2 Lk 2 exp( (1+3 2e K „[ )c ) [a w + lnt 0 )] 

( 1+32k 4 L 2 )c 2 \ ^ , 2CT H^ 2ex p( (1+3 2fl!f )c2 ) Ini? 4« 


< 2 Lk 2 exp( ( 2 ^ 1 )c )a n + 

+ 2 ° h l k 2 ex p( (1+3 20_f )c2 ) P n 2 + 0^3 ln (4r)]- 


(3.26) 


Applying Lemma [9] with a = 

e 

and x = implies that 

2 ct^jLk? exp 


o 2 r~2 f (l+32£ 4 £ 2 )c 2 \ 

2 f r 2 / L K 2 ex p( / 2e-1 ) ^ = 2 _ i _ /3 


2 exp 


/ (l+32£ 4 L 2 )c 2 \ „ __ 

V 29-1 / a H \ 


( (l+32^ 4 £ 2 )c 2 1 e 

v_ 29-1 _ ) m nra 


+ ( 


2 ct^Lk? exp(- 


lnlW < 2 -i-^ 2exp 

) 


0 


■)« 


“« \ 9+<3 


) ■<*)] . 


D s +P 
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Putting this estimation back into (|3.26l) . we have, for any t < to 


Rt 


< 2 LK 2 E to < 2 Lk 2 exp( ^ ) a H 


+2a^L k 2 exp( 


(1+32 k 4 L 2 )c ; 
20—1 

' (1+32S 4 L 2 )c 2 ' 


1(1+32 S 4 i 2 )c 2 \ 

-1-/3/ 2exp( 29,! Jan 


■)[ln2+^ln(^)] 


+2 


/ 2cr?, Lk 2 exp 
+(—**— 


( ! 




,8 

e+/3 


D e +P 


) [ '■'■$%$$ + ln(2^+ra^ L +r+„»«)] 


< 2~P 


-R /2exp 


0 


-) 


^V +/3 d^ <4<f, 

r o 


(3.27) 


where, in the last to third inequality, we have used the fact that D > D 2 . 

We can now prove the theorem by induction. Due to (|3.27p . Rt < ^ certainly holds true for 
t < to- Now assume Rt < jp for some t > to- 

To estimate Rt+i, note, by the assumption on cf>, that Mf = sup^.^-^ 1 4>\t)\ < 2 LHD t + 
|^(0)|, and 7t < ^ since 6 > 1/2, The recursive inequality (I3.2ip becomes 


D / r. Rt , ^V2^ 2 Mt{LTW t +Mt) lt , iLcK A {Mf) 2 lt 

Rt +1 +^-— +—vr— 

^ D R t 1 4k 2 (Lck 2 +8\/2/i)( 3L«Dt+|<^'(0)|) 2 7t 

- Rt ~ 7* s7 +-^- 


(3.28) 


Consider the function T(x) = x — 7177 which is increasing if x € [0, ^7]. By the definition 
of to, it is also easy to verify, for any t > to, that 

D < 2 t e E t _ 2 E t 
t& ~ c 7 1 

Therefore, by recalling (13.161) . i.e. D t < £~, we have 


4k 2 (Lck 2 +8v / 2/i)(3Lk 2 D t +|</l'(0)|) 2 7t 

_Vt 


Rt+i <• F(Rt) + 

< F( D ) + 4 ^ 2 (-^ c ^ 2 + 8 V / 2A t )(3I'K 2 i5t + |0 , (O)|) 2 7t 


Vt?> 

< Dt~P - 


It 


D 2 t~ 2 P 

E t 


+ dgt 


ft 
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(3.29) 


where 

dg = 4 k 2 (Lck 2 + 8V2f(- — ~ + |</> / (0)|) 2 . 

yl — o' 

In addition, for any 0 < <5 < min(0 — 1 — 0), applying Lemma [9] with x = i 5 ,a = 1, and 

v = 5 implies that 

In t < t S + 7 In 7 < [7 In 7] t 5 . 

This yields that 


-Et < exp(- 


(1 + 32k 4 L 2 )c 2 r 

— ) [ “« + 


(7 


2 , (1 + 32k 4 L 2 )c 2 s 2<j|, 1 r r 

^lnt)] < exp(-- 2Q _ l )[a-H + —j? L Ri-]t 5 := bgjt 5 . 
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From the above inequality and (13.29j) . and noticing | — 9 + 2f3 + <5 < 0, 9 + /3 + 5 < 1, we 
have 

Rt +1 < % [l - ^t~ e -P- s + ^~ 29+ P] 

= §[1 

(3 ' 30) 

- fE 1 “ (* + 1 ) X ] - (t+Ip’ 

where the last to fourth inequality used the fact that — jy > 1 since D > D 3 = = ^M-+dg > 
+ \]+ lbe ^ de ). This completes the proof of the theorem. □ 

We turn our attention to the proof of Theorem [5J 


Proof of Theorem [4J For any 5 £ (0, min(|, 1 — 9)), and let 

/3 = min(^-^,l -9-6). 

Let D\,D 2 and to be the same as those introduced in the proof for Theorem [3j Choose 
D := Cg, 5 ,n = max{L>i, D 2 , D 3 }, where 

L >3 = - exp( ^ + L ^ C ) [an + In + 4 k 2 5(8\/2^Lk-^= + (8\/2^ + Lck 2 )B). 

Since 1(s) | < B for any s € K, Mf < B holds true uniformly. Hence, for any t < 
to = |2(- - ^4 2 - ) e+p I, there holds Rt < -7 < jf- Assume that, for some t > to, 

2 exp( ^ + 2 ”_ 1 ,c )a n t 0 

Rt < We will prove that Rt+i < by induction. To this end, observing that Mf < B 

holds true uniformly, we know from the recursive inequality (13.281) that 

r, ^ D Rt , 32 V2ttK 2 Mf{LTlD t +M?) lt 4,LcH A {Mt) 2 lt 

Rt+1 <Rt-itw t + -vr-+- Vt - 

p lif 4k 2 B [H\/2jiLh : Df,-\-(Ht/2jj+ 1, CK? n 

- K t it Et + ^ 


Recalling (13.161) again, i.e. D t < -^== t 2 , we have 

d / r / D \ 4:K 2 B[sV2fiLKD t +(8\/2fi+LcK 2 )B\'yt 

R t +1 - * ^ +-p- (3.31) 

< Dt -bd e t 2 

where 

dg = Ak 2 B{&V2hLk — + (8\/2/r + Lck 2 )B). 

\ \ — 9 

In analogy to the argument in the proof of Theorem [31 from the above inequality and (13.311) . 
and noticing — § + 2/3 + <5 < 0, # + /3 + <5 < 1, we have 




(3.32) 
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where the last to fourth inequality used the fact, by the fact that D > _D 3 = + dg , which 

means that > 1. This completes the proof of the theorem. □ 


4 Conclusion 


In this paper, we considered the unregularized online learning algorithms in the RKHSs for 
both classification and pairwise learning problems associated with general loss functions. We 
derived sufficient conditions on the step sizes to guarantee their convergence, and established 
explicit convergence rates with polynomially decaying step sizes. This is in contrast to most 
of studies which are mainly focused on regularized online learning mmmm- Our novel 
results are obtained by using tools from convex analysis, refined properties of Rademacher 
averages and an smart induction approach. Below, we discuss some directions for future 
work. 

Firstly, the rates for Algorithm 1 and Algorithm 2 are suboptimal. For instance, in the special 
case of the least-square loss, it was proved in m that Algorithm 1 can achieve 0(T 2 InT) 
if f p € 'Hg- However, by Theorem [2], the rate is only of 0(T~ 3). It remains an open and 
challenging question on how to improve the rates for unregularized online learning algorithms 
with general loss functions. Secondly, our main theorems assume that g % = arginf gg % G £(g) 
and f% = arginfyg-^ £{f) exist. However, we know from [30l 133] that this assumption can 
be removed for the least-square loss. It is a clearly important future work to discuss whether 
this assumption will also be removed for general loss functions. Thirdly, the techniques in 
this paper rely some smoothness assumptions on the loss function, and hence can not handle 
the popular hinge loss. It remains an open question to us how to establish the convergence 
of unrgularized online learning algorithms associated with the hinge loss. Lastly, our results 
are established in the form of expectation. It would be interesting to prove the almost surely 
convergence of the last iterate of Algorithms 1 and 2. 


Acknowledgements 


We would like to thank the referees for their invaluable comments and suggestions. We are 
also grateful to Dr. Yunwen Lei for pointing out a bug in the proof of Lemma 5 in an early 
version of the paper and providing Lemma 6 to us. The work by D. X. Zhou described in 
this paper is supported by a grant from the Research Grants Council of Hong Kong [Project 
No. CityU 105011]. 

References 

[1] S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithms via algorithmic 
stability. Journal of Machine Learning Research , 10: 441-474, 2009. 

[2] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc. 68: 337-404, 
1950. 


20 



[3] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algo¬ 
rithms for machine learning. Advances in Neural Information Processing Systems (NIPS), 
2011 . 

[4] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk 
bounds. J. Amer. Statist. Assoc. 101 : 138-156, 2006. 

[5] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds 
and structural results. Journal of Machine Learning Research , 3: 463-482, 2002. 

[6] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line 
learning algorithms. IEEE Trans. Inform. Theory, 50: 2050-2057, 2004. 

[7] B. Blanchard, G. Lugosi and N. Vayatis. On the rate of convergence of regularized boost¬ 
ing classifiers. Journal of Machine Learning Research, 4: 861-894, 2004. 

[8] Q. Cao, Z. C. Guo and Y. Ying. Generalization bounds for metric and similarity learning. 
arXiv preprint arXiv:1207'.5f31 2012. 

[9] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line 
learning algorithms. IEEE Trans. Inform. Theory, 50: 2050-2057, 2004. 

[10] D. R. Chen, Q. Wu, Y. Ying and D. X. Zhou. Support vector machine soft margin 
classifiers: error analysis. Journal of Machine Learning Research, 5: 1143-1175, 2004. 

[11] S. Clemencon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U- 
statistics. The Annals of Statistics, 36: 844-874, 2008. 

[12] F. Cucker and D.-X. Zhou. Learning Theory: An Approximation Theory Viewpoint. 
Cambridge Univesity Press, 2007. 

[13] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares 
algorithm in learning theory. Found. Comput. Math., 5: 59-85, 2005. 

[14] P. Kar, B. K Sriperumbudur, P. Jain and H. C Karnick. On the generalization ability 
of online learning algorithms for pairwise loss functions. ICML, 2013. 

[15] Y. Lin. Support vector machines and the Bayes rule in classification. Data Mining and 
Knowledge Discovery, 6: 259-275, 2002. 

[16] S. Mukherjee and Q. Wu. Estimation of gradients and coordinate covariation in classi¬ 
fication. J. of Machine Learning Research, 7: 2481-2514, 2006. 

[17] S. Mukherjee and D. X. Zhou. Learning coordinate covariances via gradients. J. of 
Machine Learning Research, 7: 519-549, 2006. 

[18] R. Meir and T. Zhang. Generalization error bounds for Bayesian mixture algorithms. 
Journal of Machine Learning Research, 4: 839-860, 2003. 

[19] W. Rejchel. On ranking and generalization bounds. J. of Machine Learning Research, 
13: 1373-1392, 2012. 

[20] S. Mendelson. A few notes on Statistical Learning Theory. In Advanced Lectures in 
Machine Learning, (S. Mendelson, A.J. Smola Eds), Lecture Notes in Computer Science 
(2600): 1-40, Springer 2003. 


21 


[21] S. Smale and Y. Yao. Online learning algorithms. Found. Comp. Math., 6: 145-170, 
2006. 

[22] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low-noise, and fast rates. In 
Advances in Neural Information Processing Systems (NIPS), 2010. 

[23] I. Steinwart and A. Christmann. Support Vector Machines. Springer-Verlag, New York, 
2008. 

[24] P. Tarres and Y. Yao. Online learning as stochastic approximation of regularization 
paths: optimality and almost-sure convergence. IEEE Transaction on Information Theory, 
9: 5716 - 5735, 2014. 

[25] Richard A Vitale. Some comparisons for gaussian processes. Proceedings of the American 
Mathematical Society, pages 3043-3046, 2000. 

[26] K. Q. Weinberger and L. Saul. Distance metric learning for large margin nearest neigh¬ 
bour classification. J. of Machine Learning Research, 10: 207-244, 2009. 

[27] Y. Wang, R. Khardon, D. Pechyony, and R. Jones. Generalization bounds for online 
learning algorithms with pairwise loss functions. COLT, 2012. 

[28] Q. Wu, Y. Ying, and D. X. Zhou. Multi-kernel regularized classifiers. Journal of Com¬ 
plexity 23: 108C-134, 2007. 

[29] G. B. Ye and D. X. Zhou. Fully online classification by regularization Appl. Comput. 
Harmon. Anal., 23: 198-214, 2007. 

[30] Y. Ying and M. Pontil. Online gradient descent algorithms. Found. Comput. Math., 5: 
561-596, 2008. 

[31] Y. Ying, Q. Wu, and C. Campbell. Learning the coordinate gradients. Adv. Comput. 
Math., 37(3): 355-378, 2012. 

[32] Y. Ying and D. X. Zhou. Online regularized classification algorithms. IEEE Transaction 
on Information Theory, 11: 4775-4788, 2006. 

[33] Y. Ying and D. X. Zhou. Online pairwise learning algorithms with kernels, arxiv 
preprint, http: //arxiv. org/abs/1502.07229 

[34] T. Zhang. Statistical behavior and consistency of classification methods based on convex 
risk minimization. Ann. Stat., 32: 56-85, 2004. 

[35] P. Zhao, S. C. H. Hoi, R. Jin and T. Yang. Online AUC Maximization. In Proceedings 
of the 28th International Conference on Machine Learning, 2011. 


22 



